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(57) A set of words of a natural language is grouped 
by automatically obtaining suffix relation data that indi- 
cate a relation value for each of a set of relationships 
between suffixes that occur in the natural language, 
and, then, by automatically clustering the words in the 
set. using the relation values from the suffix relation 
data, to obtain group data indicating groups of words. 
Two or more words in a group have suffixes as in one of 
the relationships and, preceding the suffixes, equivalent 
substrings. The relationships can be pairwise relation- 
ships, and the relation value can indicate the number of 
occurrences of a suffix pair. The suffix relation data can 
be obtained using an inflectional lexicon. Complete link 
clustering can be used. 
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Description 

Field of the Invention 

[0001] The invention relates to grouping words, s 
some of which include equivalent substrings. 

Background and Summary of the Invention 

[0002] Some conventional techniques group related w 
words using some type of manual choice. For example, 
Debill, F., Analyse SyntaxkxySemantique Fondee Sur 
Une Acquisition Automatique de Relations Lexicaies- 
Semantiques, Doctoral Thesis, l^nrv. Paris XI. Jan. 26, 
1982. pp. 1 74-223. discloses such a technique to obtain is 
families of words. The Debfli thesis discloses that each 
word is truncated by removing suffixes (and possibly 
prefixes) to obtain a stem fYaoTcaT). Binary correlation 
matrices of suffixes are created by examining an auto- 
matically produced correlation matrix of suffixes and 20 
manually produced compatibility and incompatibility 
matrices of suffixes, and are corrected by inserting a 
zero for suffixes that are not compatible. The suffix 
matrices and the stems can then be used to automati- 
cally obtain families of words, with certain manual cor- 2s 
rections. 

[0003] In contrast, other conventional techniques 
automatically group words without manual intervention. 
Adamson, GL, and Boreharn. J.. The use of an Associ- 
ation Measure Based on Character Structure to tdentify 30 
Semantical ly Related Pairs of Words and Document 
Titles", Information Storage and Retrieval, vbL 10, 
1974, pp. 253-260. rJsdose such an automatic word 
classification technique based on comparison of pairs 
of consecutive characters, called digrams. The tech- 35 
nique computes a similarity coefficient between pairs of 
words based on the number of digrams common to the 
words and on the sum of the total numbers of cfigrams in 
the words, to obtain a matrix of similarity coefficients for 
ail pairs of words. The matrix is then used to cluster the <o 
words by the method of single linkage, to produce a 
numerically stratified hierarchy of clusters. 
[0004] Lermon. M.. Peirce. D.&. Tarry, aa, and 
Willett, P., "An evaluation of some conflation algorithms 
tor information retrieval*, Journal of Information 45 
Science, vol. 3. 1981. pp. 177-183, describe stemming 
algorithms that reduce all words with the same root to a 
single form by stnpping each word of its Derivational and 
inflectional affixes. If prefixes are not removed, the pro- 
cedure conflates all words with the same stem. Lennon so 
et al. descrfoe an evaluation to determine whether the 
reduction in implementation costs for machine process- 
ing algorithms is achieved at the expense of a decrease 
in conflation performance, when compared to algo- 
rithms based on manual evaluation of possible suffixes. 55 
They conclude that there is relatively Bttie difference 
despite the different ways algorithms are developed, 
and that simple, fully automated methods perform as 



well for English language information retrieval as proce- 
dures which involve a large degree of manual involve- 
ment in their development 

[0005] The invention addresses a basic problem 
that arises in grouping related words. Conventional 
techniques exhibit a tension between accuracy and 
speed: Manual techniques can be used to group words 
very accurately, but are complex and tedious. Automatic 
techniques, on the other hand, can be very fast, but pro- 
duce groupings that are not as generally accurate as 
can be obtained manually. 

[QCGS] The invention is based on the Recovery of a 
new automatic technique for grouping words that allevi- 
ates the tension between accuracy and speed. The new 
technique automatically obtains suffix relation data indi- 
cating a relation value for each of a set of relationships 
between suffixes that occur in a natural language; the 
relation value for a relationship couEd, for example, be its 
frequency of occurrence in a set of words from the nat- 
ural language. The new technique then performs auto- 
matic clustering of a set of words using the relation 
values from the suffix relation data to obtain croups of 
words, where two or more words in a group have suf- 
fixes as in one of the relatio n ships and, preceding the 
suffixes, equivalent substrings. 
[0007] The new technique can be implemented for 
pairwise relationships between suffixes, with the rela- 
tion value of each suffix pair being the number of pairs 
of words that are related to each other by the pair of suf- 
fixes. Automatic clustering can then be performed with 
the pairwise simaarity between words being the greatest 
relation value of the suffix pairs, if any. that relate the 
words to each other. Complete Dnk clustering can be 
used. The new technique can be implemented using a 
lexicon, such as an inflectional lerocon, to automatically 
obtain a word Bst and then to use the word list in auto- 
matically obtaining suffix pair data. The suffix pair data 
can indicate pairs of suffixes that relate words to each 
other and, for each pair of suffixes, a relation value indi- 
cating a number of times the suffix pair occurs in the 
word Dst The suffix pair data can further indicate, for 
each suffix in a pair, a part of speech, and the relation 
value can accordingly indicate the number of times the 
suffixes in the suffix pair occur in the word Bst with the 
indicated parts of speech. 

[0038] A representative for each group of words 
indicated by the group data can also be automatically 
obtained, such as the shortest word in the group. Fur- 
ther, a data structure, such as a finite state transducer 
(FST), can be automatically produced mat can be 
accessed with a word in a group to obtain the group's 
representative. The data structure can also be 
accessed with a group's representative to obtain aOstof 
the words in the group. 

[0009] The new technique can further be imple- 
mented in a system that inchxtes memory and a proces- 
sor that automatically obtains the suffix relation data 
and automatically clusters the set of words to obtain the 
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group data, storing the suffix relation data and the group 
data in memory. The processor can also automatically 
produce a data structure as described above and pro- 
vide it to a storage medium access device tor storage on 
a storage medium or to another machine over a net- 
work. 

[001 0] In comparison with conventional techniques 
tor grouping wonfe with manual choice or other manual 
involvement, the new technique is advantageous 
because rt is automatic and therefore can be performed 
quickly. In addition, with appropriate clustering tech- 
niques, the new technique can approach the accuracies 
obtainable with manual techniques. 
[0011] In comparison with conventional automatic 
techniques for grouping worfe, the new technique is 
significantly more accurate. Indeed, when applied to the 
problem of stemming, the new technique is also more 
accurate than conventional semi-automatic stemmers 
that rely on a nst of suffixes, a set of rules, and a fist of 
exceptions. 

[Q012] The new technique is also advantageous 
because it can be readily applied to additional lan- 
guages tor which inflectional lexicons are available. A 
language's Inflectional lexicon can be used to automati- 
cally obtain suffix pairs with relation values. 
[0013] The new technique is also advantageous 
because it can be implemented to use complete words 
as group representatives. In comparison with tech- 
niques that use substrings to represent groups, this is 
advantageous because it avoids ambiguous represent- 
atives that could represent more than one p/oup. 
[0014] The following description, the drawings, and 
the claims further set forth these and other aspects, 
objects, features, and advantages of the invention 

Brief Description of the Drawings 

[0015] 

Fig. 1 is a schematic flow diagram showing how 
word groups can be obtained using suffix relation 
data. 

Rg. 2 is a flow chart shewing general acts in obtain- 
ing word groups by automatically obtaining suffix 
relation data and by automatically clustering a set 
of words. 

Rg. 3 is a schematic diagram showing components 
of a system that can perform the general acts in Rg. 

2. 

Rg. 4 is a schematic diagram of a system in which 
the general acts in Rg. 2 have been implemented. 

Rg. 5 is a flowchart showing how the system of Rg. 
4 implements acts as in Rg. 2. 



Rg. 6 is a flow chart showing in greater detail how 
suffix pairs are obtained in Rg. 5. 

Rg. 7 is a flow chart showing in greater detail how 
5 words are clustered in Rg. 5. 

Detailed Description of the Invention 

A. Conceptual Framework 

10 

[0018] The following conceptual framework is heJp- 
ful in understanding the broad scope of the invention, 
and the terms defined below have the indicated mean- 
ings throughout this application. inducSng the claims. 

is [0017] The term tfata" refers herein to physical sig- 
nals that indicate or include information. When an item 
of data can indicate one of a number off posstole alterna- 
tives, the item of data has one of a number of "values". 
For example, a binary item of data, also referred to as a 

20 "bit" has one of two values, irrterchangeaWy referred to 
as "1" and "0" or "ON" and "OFF* or "highland "tow". 
[C018] The term "date" includes data existing in any 
physical form, and includes data that are transitory or 
are being stored or transmitted. For example, data coutd 

25 east as electromagnetic or other transmitted signals or 
as signals stored in electronic, magnetic, or other form 
[0019] "Circuitry" or a "circujr is any physical 
arrangement of matter that can respond to a first signal 
at one location or time by providing a second signal at 

30 another location or time. Circuitry "stores" a first signal 
when it receives the first signal at one time and, in 
response, provides substantially the same signal at 
another time Circuitry "transfers" a first signal when it 
receives the first signal at a first location and, in 

35 response, provides substantially the same signal at a 
second location. 

(50201 A "data storage medium" or "storage 
medium" is a physical medium that can store data. 
Examples of data storage media include magnetic 

40 media such as diskettes, floppy cfisks. and tape; optical 
media such as laser disks end CD-RCKVJs; and semicon- 
ductor mecSa such as semicomiuctor ROWls and RAMs, 
As used herein, "storage medium" covers one or more 
cfistinct units of a mscSum that together store a body of 

46 data. For ©sample, a set of diskettes storing a single 
body of data would together be a storage medium. 
[0021] A "storage medium access device" is a 
device that includes circuitry that can access data on a 
data storage medium Examples include drives for 

so accessing magnetic and optical data storage media. 
[0022] "Memory circuitry" or "memory" is arty cir- 
cuitry that can store data, and may include local and 
remote memory and input/output devices. Examples 
include semiconductor ROMs. RAMs, and storage 

55 medium access devices with data storage media that 
they can access. 

[0023] A "data processor" or "processor" is any 
component or system that can process data, and may 
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include one or more central processing units or other 
processing components. 

[0024] A processor performs an operation or a tunc- 
tion "automatically" when it performs the operation or 
function independent of concurrent human intervention s 
or control. 

[0025] Any two components are "connected" when 
there is a combination of circuitry that can transfer sig- 
nals from one of tfte components to the other. For exam- 
ple, two components are "connected" by any /o 
combination of connections between them that permits 
transfer of signals from one of the components to the 
other. 

[0026] A "network* is a corjnbination of circuitry 
through which a connection for transfer of data can be is 
established between machines. An operation "estab- 
lishes a connection over" a network if the connection 
does not exist before the operation begins and the oper- 
ation ca u ses the connection to exist 
[C$27] A processor "accessed" an item of data in 20 
memory by any operation that retrieves or modifies the 
Hem or i n for ma tion within the item, such as by reading 
or writing a location in memory that includes the item. A 
processor can be "connected for accessing" an item of 
data by any combination of connections with local or 2s 
remote memory or input/output devices that permits the 
processor to access the item. 
[CC28J A processor or other component of circuitry 
"uses" an ten of data in performing an operation when 
the result of the operation depends on the value of the 30 
item. 

[0029] A processor accesses a first item of data 
"with" a second item of data rf the processor uses the 
second item of data in accessing the first, such as by 
using the second item to obtain a location of the first ss 
item of data or to obtain information from within the first 
item of data. 

[0030] To "obtain" or "produce" an item of data is to 
perform any combination of operations that begins with- 
out the item of data and that results in the item of data. 40 
To obtain a first item of data "based on" a second item 
of data is to use the second item to obtain the first item. 
[0031] An item of data TncEcates" a thing, event or 
characteristic when the item has a value that depends 
on the existence or occurrence of the thing, event, or 45 
characteristic can be obtained by operating on the item 
of cfata. An item of data ""mcficates" another value when 
the item's value is equal to or depends on the other 
value. 

[0032] An operation or event "transfers" an item of so 
data from a first component to a second rf the result of 
the operation or event is that an item of data in the sec- 
ond component rs the same as an item of data that was 
in the first component prior to the operation or event 
The first component "provides" the data, and the sec- ss 
ond component "receives" or "obtains" the data. 
[0033] A "natural language" is an identified system 
of symbols used for human expression and communica- 



tion within a community, such as a country, region, or 
locality or an ethnic or occupational group, during a 
period of time. Some natural languages have a standard 
system that is considered correct, but the term "natural 
language" as used herein could apply to a dialect, ver- 
nacular, jargon, cant argot or patois, if identified as dis- 
tinct due to differences such as pronunciation, grammar, 
or vocabulary. 

[0034] A "natural language set" is a set of one or 
more natural languages. 

[0035] "Character" means a discrete element that 
appears in a written, printed, or phonetically franscrbed 
form of a natural language Characters in the present 
day English language can thus include not only alpha- 
betic and numeric elements, but also punctuation 
marks, diacritical marks, mathematical and logical sym- 
bols, and other elements used in written, printed, or 
phonetically transcribed English. Qftore generally, char- 
acters can include, in addition to alphanumeric ele- 
ments, phonetic, ideographic, or pfctographic elements. 
[0036] A "word" is a string of one or more elements, 
each of which is a character or a combination of charac- 
ters, where the string is treated as a semantic unit in a! 
least one natural language A word "occurs" in each lan- 
guage in which it is treated as a semantic unit 
[0037] A lexicon" is used herein to mean a data 
structure, program, object or dance that indicates a set 
of words that may occur in a natural language set A lex- 
k»n may be said to "accept" a word it indicates, and 
frose worcfe may thus be called "acceptable" or may be 
referred to as In" or "occurring in" the lexicon. 
[C033] As used herein, an "inflectional lexicon" is a 
lexicon that when accessed with a correctly inflected 
input word, provides aooess to a lemma or normalized 
dictionary- entry form of the input word. An inflectional 
lexicon typically includes one or more data structures 
and a lookup routine tor using the input word to access 
the data structures and obtain the lemma or an output 
indicating the input word is unacceptable 
CQ039] A "prefix" is a substring of characters occur- 
ring at the beginning of a word, and a "Suffix" is a sub- 
string of characters occurring at the end of a word. 
[0040] A suffix "follows" a substring in a word and 
the substring "precedes 0 the suffix if the last character 
of the substring immediately precedes, the first charac- 
ter of the suffix. 

[0041] A "relationship" between suffixes refers to 
the occurrence in a natural language set of a set of 
words that are related but that have different suffixes, 
which are thus "related suffixes". A "pairwise relation- 
ship" is a relationship between two suffixes. A relation- 
ship between suffixes "occurs* when a natural language 
set includes a set of related words, each of which has 
one of the suffixes. If a part of speech is also indicated 
for each suffix, the relationship only "occurs" if the 
related word that has a suffix also has the indicated part 
of speech. 

[0042] Substrings that precede related suffixes in a 



4 



7 



EP 1011 056 A1 



8 



set of different words are "equivalent" rf the words are all 
related because of a relationship between the sub- 
strings. For example, it is conventional to make minor 
graphical changes in a substring that precedes a suffix 
during inflectional changes, such as by adding or delet- 5 
ing a diacritical mark or otherwise changing a character 
to indicate a change in pronunciation or by changing 
between a single character and a doubled character. 
Substrings that precede suffixes may also be equivalent 
because they are phonetic alternatives, because of a w 
historical relationship through which one developed 
from the other or both developed from a common 
ancestor, because they are cognates in two different 
languages, or because of any of various other relation- 
ships. - y5 
[0043] The "frequency of occurrence" of a suffix 
relationshp in a set of words is the number of different 
subsets of words in the set that are related by the suffix 
relationship. 

[0044] A set of suffixes "relates" a set of words if so 
each of the words can be obtained from any other m the 
set by a process that includes removing one of the suf- 
fixes and adding another of the suffixes. The process 
may also include other rnodiftcattons, such as to a sub- 
string preceding the suffix or to a prefix that precedes zs 
the substring, but if there are no such other motfifica- 
tions. a prefix that includes the substring "occurs" wth 
each of the suffixes in the set to form the set of related 
words, 

[0045] A "clustering" is an operation that groups 30 
items based on similarity, association, or another such 
measure. To "duster* is to perform a clustering. 
[0046] A "pairwise similarity" is an item of data indi- 
cating a measure of similarity between two items. 
[0047] A finite state transducer (FST) is a data 35 
processing system having a finite number of states and 
transitions (or arcs), each transition originating in a state 
and learing to a state, and in which each transition has 
associated values on more than one level. As a result 
an FST can record to an input signal indicating a value <o 
on one of the levels by following a transition with a 
matching value on the level and by providing as output 
the transition's associated value at another level. A two- 
level transducer, for example, can be used to map 
between input and output strings, and if the values are 45 
character types, the input and output strings can be 
words. 

[0048] A Tinrte state transducer data structure" or 
"FST data structure* is a data structure containing infor- 
mation sufficient to define the states and transitions of so 
an FST 

[0049] The term "word list" is used herein in the 
generic sense of a data structure that indicates a set of 
words, The data structure could, for example, be a finite 
state machine (FSM) data structure, an FST data struo- 55 
ture, a list data structure, or any other appropriate type 
of data structure. 

[0050] A "representative" of a group is an item of 



data that is unique to the group so that it can be used to 
represent the group. A representative may be one of the 
menders of a group of hems of data or it may be an item 
of data obtained in some other way. 

B. General Features 

(0051] Figs. 1-3 illustrate general features of the 
invention. 

[0052] Rg. 1 isafk^diagranithatshcwsschemat- 
icaDy how word groups can be obtained In Rg. 1. the 
boxes at left represent external input to word grouping, 
the boxes in the center represent operations performed 
during word grouping, and the boxes at right represent 
intermediate and final word grouping results. 
[0053] The input in box 10 provides information 
about a natural language set from which the operation 
m box 12 can obtain suffix relation data, illustratively 
shown as an intermediate result in box 14. As shown, 
the suffix relation data include, for each suffix relation- 
ship, a relation value; suffix relations A and B illustra- 
tively have relation values a and b, respectively. 
[0054] The suffix relation data in box 14 and a set of 
words, wordl through wordM as shown in box 20, can 
then be used by the operation in box 22, which dusters 
the words using the relation values to obtain group data 
indicating groups of the words (word group 1 through 
word group N), illustratively shown as a final result In 
box 24. As illustrated in box 24, a word group (iHustra- 
tively word group 1) can include two or more won& that 
have suffixes as in cmetf the relatw 
ing the suffixes, equivalent substrings. 
[0055] tnbaff30inFq).2,afechiiiqueautDm 
obtains suffix relation data indicating a relation valuator 
each of a set of relationships between suffixes that 
occur in a natural language set Then, in box 32, the 
technique automatically clusters a set erf words that may 
occur in the natural language set Clustering in box 32 
uses the relation values from the suffix relation data, 
and obtains group data indicating groups of words, 
where a group includes two or more words that have 
suffixes as in one of the relationships and, precetf ng 
the suffixes, equivalent substrings. 
[005S] Machine 50 in Rg. 3 includes processor 52 
connected for receiving information about a natural lan- 
guagefrom a source 54 and also connec te d tor aocess- 
ing data in memory 56 and tor receiving instruction data 
60 indicating instructions processor 52 can execute. 
[0057] In executing the instructions indicated by 
instruction data 60, processor 52 obtains suffix relation 
data 62 which include, for each of a set of suffix relation- 
ships, a relation value. Processor 52 then dusters a set 
of words indicated by word set data 64 using relation 
values from suffix relation data 62 to obtain word group 
data 66. indicating groups of words, a group including 
two or more words that have suffixes as in one of the 
relationships and. preceding the suffixes, equivalent 
substrings. 
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[0058] Fig. 3 illustrates three possible destinations 
to which data output circuitry 70 could provide word 
group data 66-memory 72, storage medium access 
device 74, and network 76. In each case, word group 
data 66 could be provided separately or as part of a 5 
body of data that may also include instructions and 
other data that would be accessed by a processor in 
executing the instructions, 

[0059] Memory 72 could be any conventional mem- 
ory within rr&chine 50, including random access mem- 10 
ory (RAM) or read-only memory (ROM), or could be a 
peripheral or remote memory device of any kind 
[0080] Storage medium access device 74 could be 
a drive or other appropriate device or circuitry for 
accessing storage medium 80. which could, for exam- is 
pie, be a magnetic medium such as a set of one or more 
tapes, diskettes, or floppy disks; an optical medium 
such as a set of one or more CD-ROMs; or any other 
appropriate medium for storing data. Storage medium 
80 could be a part of machine 50, a part of a server or 20 
other peripheral or remote memory device, or a soft- 
ware product In each of these cases, storage metfium 
80 is an article of manufacture that can be used in a 
machine. 

RttSI] Network 76 can provide word group data 66 2s 
to machine 90. Processor 52 in machine 50 can estab- 
lish a connection with processor 92 over network 76 
through data output circuitry 70 and network connection 
circuitry 94. Either processor could initiate the connec- 
tion, and the connection could be established by any 30 
appropriate protocol. Then processor 52 can access 
word group data 66 stored in memory 56 and transfer 
the word group data 66 to processor 92 over network 
76. Processor 92 can store word group data 66 in mem- 
ory 94 or elsewhere, and can then access it to perform 35 
lookup. 

C. Implementation 

[0082] The general features descrfred above could 40 
be implemented in numerous ways on various 
machines to obtain word groups. An implementation 
described below has been implemented on a Sun 
SPARC workstation running Sun OS and executing 
code compiled from C and Pert source code. 45 

C.1. Overview 

[Q0S3] In Ftg. 4. system 120 includes the central 
processing unit (CPU) 122 of a Sun SPARC worfesta- 50 
tion, which is connected to display 124 for presenting 
images and to keyboard 126 and mouse 128 tor prewir- 
ing signals from a user. CPU 122 is also connected so 
that it can access memory 130, which can illustratively 
include program memory 132 and data memory 134. 55 
[00S4] The routines stored in program memory 132 
can be grouped into several fonrtkms-sutfix pair extrac- 
tion routines 140. relational family construction routines 



142, FST conversion routines 144, and lookup routines 
146. Fig. 4 also shows several data structures stored in 
data memory 134 and accessed by CPU 122 during 
execution of routines in program memory 
132— inflectional lexicon 150; word list 152; list 154, list- 
ing suffix pairs with frequencies; list 156, listing rela- 
tional families of words; FST data structure 158; and 
miscellaneous data structures 160. Inflectional lexicon 
150 can be any appropriate lexicon tor the language of 
the words in word list 152, such as Xenix lexicons for 
English or for French, both available from trOQght Cor- 
poration, Palo ARot Cafffomra. Vvord fet 152 can be any 
appropriate word Est, such as the word lists that can be 
extracted from the Xerox lexicons for English or for 
French. 

[C0S5] Fig. 5 illustrates high4eveJ acts performed by 
processor 122 in executing some of the routines stored 
in program memory 132. 

[C066] In executing suffix pair extr ac tion routines 
140. processor 122 uses inflectional lexicon 150 to 
automatically obtain word Est 152, as shown in box 200. 
Then processor 122 uses word fist 152 to automaticafly 
obtain list 1 54, which Gsts a set of suffix pairs and then- 
frequencies, as shown in box 202. 
[POST] The acts in boxes 200 and 202 are thus one 
way of implementing the act in box 30 in Fig. 2, and the 
act in box 200 is optional because a word list couid be 
obtained in other ways. The suffix pairs can be viewed 
as strings for making a transition between pairs of 
related words in the wojd Est, with one suffix being a 
strmg that is removed from one word in the pair to obtain 
a prefix and the other suffix being a string that is then 
added to the prefix* after any appropriate modifications 
in the prefix, to obtain the other word in the (Mir. 
[0CS8] In executing relational family construction 
routines 142, processor 122 then automatically clusters 
a set of words from word Est 152 using the suffix pair fre- 
quencies from fist 154, as shown in box 204. The set of 
words can, for example, share a prefix. The result is list 
1 56, which Gsts a set of word families, each of which is 
a subset of the words clustered in box 204. As sug- 
gested by the dashed GneinRg. 5. the act in box 204 
can be repeated for each 0? a number of sefe o? words; 
each set can, for example, Include words that share a 
speaftc substring. The act in box 204 is thus one way a? 
impiementing the act in best 32 in Ftg. 2. The word fam- 
ilies obtained in this manner are referred to as "rela- 
tional families" to distinguish them from conventional 
derivational families, which they resemble. 
[0359] . In executing FST conversion routines 144, 
processor 122 obtains FST data structure 158 trot pro- 
vides an input-output pairing between each of the words 
in a family and a representative of the family, as shown 
in box 206. A family's representative could, for example, 
be its shortest word. The act in box 2CS is optional, and. 
Hke the acts in boxes 200, 202, and 204, can be per- 
formed automatically. 

[0070] In executing lookup routines 146, processor 



6 



11 



EP 10111 OSS A1 



12 



122 can provide input words to FST data structure 158 
to obtained desired output such as a word family repre- 
sentative or a list of the words in a family. The input 
words can be received from keyboard 126 or can be 
indicated by a selection from mouse 128. and the output 5 
can be presented on display 124 as shown. For exam- 
ple, FST data structure 158 can respond to an input 
word by providing as output the representative of a rela- 
tional family from box 204 that includes the input word. 
Conversely, FST data structure 158 can respond to the w 
representative of a relational farrtiy by providing as out- 
put all the words in the relational family. The FST thus 
facilitates raped lookup of a representative of each rela- 
tional famfly or of the words of a relational family. 
[0071] The implementation in Figs. 4 and 5 is thus is 
based on two intuitive premises: First, it should be pos- 
sible to automatically extract a set of suffixes from a lan- 
guage's inflectional lexicon. Second, it should be 
possible to automatically obtain information from a lan- 
guage's inflectional lexicon about relationships between 20 
suffixes that correspond to relationships between farm- 
lies of words, 

C.2. Suffix Pair Extraction 

25 

[0072] Fig. 6 illustrates in more detail how the acts 
in boxes 200 and 202 in Rg. 5 can be implemented. 
[0073] The intuitive premise behind suffix extraction 
as in Fig. 6 is that long words of a given language tend 
to be obtained through derivation, and more precisely 30 
through addition of suffixes, and thus long words can be 
used to identify regular suffixes. 
[0074] ft is hefpful to think of two words w1 and w2 
of a given language as being p-simflar if and only if too 
conditions are met Their first p characters of both wl 35 
and w2 are the same and the (p+1 )th characters of wl 
and w2 are not the same. The character strings si and 
s2 that begin with the (p+1 )th characters of wl andw2 
respectively can be referred to as pseudo-suffixes, and 
either or both s1 and s2 can be the empty string-both 40 
can be empty rf wl and w2 differ only in their part of 
speech. The pair (s1. s2) can be referred to as a 
pseudo-suffix pair that folks w1 andw2. 
[0075] The notion of p-similarity can be understood 
from examples: The EngOsh words "depJcraWe" and 45 
'deptoringftr are S-ssmilar, with (ab)e, ingty) an English 
pseudo-suffix pair that links thenx Since "depeorabJe" is 
an adjective and •dspicringty" is an adverb, the pseudo- 
suffix pair can be more precisely written (atte+AJ, 
irtgiy+AV), where +AJ stands for adjective and +AV tor so 
adverb; this means that a transition from an adjective to 
an adverb can be made by removing the string "able" 
and adding the string "tngry". or vice versa. 
[0076] The notion of p-simflarity can be generafized 
to the broader notion of pArequrvalence, where two ss 
words w 1 and w2 of a given language set are pAq-equiv- 
ai ent if the following conditions are met Th e first p char- 
acters of w1 and the first q characters of w2 are 



equivalent and the (p+1 )th character of w1 is not equiv- 
alent to the (o>1 )th character of w2. Under this defini- 
tion, polarity is a special case of ^equivalence in 
which p=q and the first p characters of w1 are identical 
to the first p characters of w2. Where w1 andw2arepA) 
equivalent, the character strings s1 and 62 that begin 
with the (p+1 )th character of wl and the (q+1 )th char- 
acter of w2 respectively can be referred to as pseudo- 
suffixes, as described above in relation to p-simflarity. 
[0077] Ideally, the implementation in Rg. 6 would 
return only those pseudo-suffix pairs that are valid, 
where a pair is valid if and only if ft includes two actual 
suffixes of the language that describe the transition 
between two words in a derivational family of words in 
the language. In practice, however, an automatic tech- 
nique can only approximate this ideal. The Implementa- 
tion in Fig. 6 uses p/q-equivalertce and the number of 
occurrences of a pseudo-suffix pair to reduce the 
number of invalid pairs it returns. 
[0078] As pA^eqinvafenca (or p-sirritarity) of two 
words increases, the prababffity increases that their 
pseudo-suffix pair is vaEd. For example, the 2-similar 
English words "cfivtde" and "cfiffer" have a pseudo-suffix 
pair (vide+V, ffer+V), where +V stands for verb. But this 
pair is invafid. because neither "vide" nor "ffter" is an 
actual suffix of English, and aiso because "divide* and 
"differ do not belong to th e same derivational family. On 
the other hand, any pair of 10-simBar English words is 
very fcery to have a vafid pseudo-suffix pair because 
the pseudo-suffixes are likely to be actual suffixes and 
the words are Btely to belong to the same derivational 
family. But accepting only pseudo-suffix pairs from 10- 
simflar words wil eliminate many vafed pseudo-suffix 
pain; that only occur with shorter prefixes. 
[0079] The implementation of Fig. 6 accepts a 
pseudo-suffix pair only r? the pair can be obtained from 
a pair of words that are at least 5/5-equrvalent This 
level of p^-eqiirvalence represents a good tradeoff 
between the increased Ittefihocd thai a pair wffl be valid 
if obtained from words that have high pflq-equnod ence 
and the risk of screening out vaEd pairs obtained from 
words with lower pAq-eqytatertca Experience suggests 
that a slight change in pftj-equivalence (or p-sMarity) 
win not significantly change the resulting set of pseudo- 
suffix pairs. 

GKS0] Simaarty; as the number of occurrences of a 
pseudo-suffix pair increases, the probabifity increases 
that the pseudo-suffix pair is valid. A pair that occurs 
only once or a few times may relate to 
nomenon or be invafid. The implementation off Fig. 6 
accepts a pseudo-suffix pair only if the pair occurs with 
at least two different prefix sets, where the prefixes in 
each set are p/q-equivalent. This minimal value is suffi- 
cient to screen out a large number of invalid pairs, but 
remains quite loose so that the same implementation 
can be applied to a variety of different languages. 
[COS'S] This parameter can be better understood by 
considering exemplary French suffix pairs extracted 
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from a French inflectional lexicon, each with its number 
of occurrences: 

ation+N:er+V/782 
+AJ:ment+AV/460 
eur+AJ:k>n+N/380 
er+V:on+N/50 
sation+N : tarisme+N /5 

All of these pairs are valid except the last, whk* occurs, 
for example, in 'autorisation - autorftarisme" (airthorisa- 
tkxi - authoritarianism). These pairs also show that a 
valid pseudo-suffix pair does not always fink words in 
the same derivational family, fpr example, the pair 
(er+V. on+V) yieSds a link between "saJer" and -salon" 
(in English, salt and lounge), even though the two words 
refer to different concepts. Vaficfity only requires that a 
pseudo-suffix pair relates two words that are in the 
same derivational family, a criterion which is met by the 
pair (er+V, on-o-V) because it relates, for example, triser" 
and "frison" (in English, curt (+V) and curl (+N)). 
[CG82] As shown in Rg. 6, the implementation 
begins in box 250 by obtaining an inflectional lexicon 
FST data structure, such as the Xerox inflectional lexi- 
cons for Engfish and French available from trOGght Cor- 
poration of Palo Alto, CaTrfomia. These lexicons 
conclude each acceptable character string with a part- 
of -speech (POS) tag. The act in box 250 can include 
obtaining a handle to access a lexicon that is already 
stored in memory, as illustrated by inflectional lexicon 
150 in Fig. 4. 

[033] "The act in box 252 extracts the non-inflected 
or lemma side of the FST data structure to obtain an 
unfftered FSM data structure that accepts ail charac- 
ter+POS strings that are acceptable to the lemma side 
of the FST data structure Because the FST data struc- 
ture typically accepts an infinite set of character+POS 
strings, including strings of numbers, the unffltered FSM 
data structure typically also accepts an infinite set 
although rt could be finite if the FST only accepts a finite 
set of character+POS strings. The act in box 252 can be 
implemented with a conventional automatic FSM 
extraction utility. Ways to implement this and related 
operations can be understood from US-A-5.625.554 
and Kaplan, RM, and Kay, M.. "Regular Models of Pho- 
nobgical Rule Systems", Computational Linguistics, 
Vol 20, No. 3, 1994, pp. 331-380 ("the Kaplan and Kay 
article"). 

[0341 The act in box 254 than filters &ie unffltered 
FSM data structure from box 252 to produce a filtered 
FSM data structure that accepts only suitable charac- 
ter+POS strings, thus producing word list 152 in Rg. 4. 
For example, character strings can be filtered out that 
end with inappropriate POS tags, such as POS tags that 
indicate numbers of various kinds or that indicate other 
types of character strings that are inappropriate. In lan- 
guages in which words can be created by concatena- 
tion, such as German, character strings may be filtered 



out that exceed a maximum appropriate length or a 
maximum appropriate number of concatenated parts, 
such as four. The act in box 254 can be implemented by 
composing the unffflered FSM with a filtering FSM that 

5 defines the conditions for inducing a character string in 
the f atered FSM. The filtering FSM can be created using 
conventional techniques, similar to those described in 
US-A-5,625,554 and the Kaplan and Kay article. 
[0085] The act in box 258 then uses the filtered 

10 FSM data structure from box 254 to create a prefixing 
FST data structure that outputs a normalized prefix- 
equivalent in response to an acceptable word. In other 
words, the input level of the prefixing FST accepts aQ of 
the character+POS strings accepted by the tittered 

is FSM, and the output level provides, for each input char- 
acter+POS string, a normalized p-chamcter string that 
is pAj-equrval enttoa prefix of the input character+POS 
string for some value of q. 

[0086] The act in box 258 can be implemented by 
20 creating an intermediate FST that responds to a charac- 
ter string by aiitornatically rromarKirtg characters until a 
string of p normalized characters is obtained; normaliz- 
ing can inciude, for exampJe, removing cfiacrftical marte 
and possSbly mating replacements of characters and 
25 other rraxfflications that produce equivalent prefixes. In 
the current implementation, each normalizing operation 
replaces a character with another character, thus main- 
taining one-to-one or pj^-equivalence between charac- 
ters; the implementation could readily be extended to 
30 norrnafize a doubted character as a single character or 
vice versa or to make other normalizations that change 
the number of characters. 

[0087] The intermediate FST can then be run on 
the word list ind ic at e d by the filtered FSM to produce a 

as prefix FSM that indicates aO the prefix-equivalents. The 
prefix FSM can be composed with the filtered FSM to 
produce the prefixing FST. In the prefixing FST. the 
(p+1)th and foJlowing characters atorig each path at the 
input level are paired with an epsilon at the output level, 

40 meaning that no output is provided in response to those 
characters. 

The act in box 260 inverts the prefixing FST 
from box 25S to produce an inverted prefixing FST with 
prefix equivalent at the input level and with acceptable 

46 words at the output level. The act in boa 260 can be 
implemented by simply switching the elements in the 
label of each transition in the FST. 
[CC8S] The act in box 262 composes the prefixing 
FST from box 255 with the inverted prefixing FST from 

so box 260, performing the composition so that the prefix- 
equtvalents drop out and so that eptitons are added to 
lengthen the shorter of an input character+POS string 
to equal an output character+POS string, producing a 
prefix-group FST The prefix-group FST accepts the 

55 word list of the filtered FSM from box 254 at its input 
level and can output every word that had the same pre- 
fix-equivalent in the prefixing FST from box 256. The 
composition in box 262 can be performed with conven- 
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tional techniques similar to those described in US-A- 
5.625.554 and the Kaplan and Kay article. 
[0090] The act in box 264 then uses the prefix- 
group FST from box 262 to produce a list of preltx+suffix 
alternatives. The act in box 264 can be implemented by s 
reading out the character pairs along each path of the 
prefix-group FST and comparing the characters in each 
pair untO a mismatch is found. Before finding a mis- 
match, only one character is saved for the prefix, but 
after a mi smatch is found, both characters of each pair 10 
are saved for the suffix. 

[00911 For example, for the path that relates the 
verb "produce" to the noun "production", the prefix+suf- 
fix alternatives could be represented by the following 
string: (p, r, <x d, u, & <e1), <+Vi>, (&:o). (An), is 
<d:+N>). where "A" characters have been included to 
compensate for the Difference in length, and where "+V 
and "+N" indicate a verb and a noun, respectively. The 
prefix of the path is thus "prcduc". and the path has two 
alternative suffixes, "e+V and TjothW*. 20 
[0092] The Dst produced in box 264 thus impSicrtiy 
detects the p-ssmftarity each pair of related words. In 
addition, the list can ba produced in such a way that it is 
alphabetized, so that larger groups of p-simDar wonfc 
are adjacent within the fet As a result the list can read- 2s 
ily be used to produce pseudo-famifies of p-simBar 
words, which can then be used as described below in 
relation to Fig. 7. 

[0833] The act in bow 266 can then use the list pro- 
duced in box 264 to produce Dst 154 of suffix pairs and 30 
frequencies. List 154 can be produced by converting the 
suffix alternatives of each item on the list from box 264 
to a suffix pair. If the suffix pair matches a suffix pair on 
a fist of previously obtained suffix pairs, the act in box 
266 increments its frequency; if not. the suffix pair is as 
added to the list with a frequency of one. Because the 
automatically obtained suffix pairs in list 154 may not aQ 
include vafid suffixes of a natural language set. they are 
referred to below as "pseudo-suffix pairs". 

40 

C.3. Word Family Construction 

[0084] Rg. 7 illustrates in more detail how the act in 
box 204 in Rg. 5 can be implemented. 
[0885] Word family construction as in Fig. 7 as 
addresses the problem of automatically grouping aD the 
words together that belong to the same derivational 
family without grouping words together that belong to 
different derivational famifies. Simple approaches that 
have been proposed typically do not overcome this so 
problem. 

[0086] One simple approach is to add words to a 
family that rave some level of p-simaarrty with words in 
the family and mat relate to words in the family through 
suffix pairs. For example, given the two English suffix 55 
pairs (+V. abto+AJ) and (+V. ment+N), we can first 
group the 6-sirrtilar words "deploy" and "deptoyable", 
and then add to this family the word "deployment". But 



this approach will also group "depart" and "department" 
into the same family. 

[0097] In general, then, suffix pairs relate words 
that do not belong to the same derivational family. This 
problem is general to any stemming procedure, as illus- 
trated by the English string "merit" at the end of a word, 
which may or may not correspond be a suffix. This phe- 
nomenon is frequently cited by opponents of fully auto- 
matic stemming procedures, and it is typically overcome 
with a Gst of exceptions, including exceptions to control 
removal of the suffix "merit*. 
[0038] The intuitive premise behind word Camay 
construction as in Rg. 7 is that suffixes form families, 
and that the use of a suffix usually coincides with the 
use of other suffixes in the same family, while suffixes 
from different families do not co-occur. For example, if 
the string "menT is not a suffix, as In ^department*, then 
it is OkeJy that the word obtained after removal of the 
string, i.a "depart, wfll support suffixes that do not usu- 
ally cr>occur with "ment". such as "ure" which produces 
"d eparture". TOs premise is supported by the manually 
created suffix families disclosed in the Debffi thesis, rJs- 
cussed above. 

[0099] To automatically construct relational families 
of words based on famifies of suffixes, the implementa- 
tion of Fig. 7 uses a hierarchical aggiomerative cluster- 
ing approach, allowing an element to be added to a 
cluster in accorc&nce with a similarity msasura Specif- 
ically, Fig. 7 illustrates a complete fink approach of the 
type disclosed by Rasmussen, E., "Clustering Algo- 
rithms", in Frakes, W.a and Baeza-Ytfes, R.. ed&. 
Information Retrieval-Data Stwctures and Algorithms, 
Prentice Hafl. Englewood Cliffs, New Jersey, 1992. pp. 
419-436. The complete link approach takes Into 
account the relation between an element and all the ele- 
ments that are already in the cluster to which it may be 
added. The simrlarity measure used is a measure of 
suffix similarity which is initially based on a frequency 
from box 202. 

[0100] Suffix smaarity alone, however, is not suffi- 
cient to relate two words. For example, the suffix pair 
(abte+AJ, ingly+AV) would, without more, relate the 
words "enjoyable" and idepSorortgly". which are unre- 
lated. Therefore, the implementation of Rg. 7 also uses 
prefix criteria to determine whether words should be 
grouped together. 

[0101] A first prefix criterion © appfied to obtain 
pseudo-famifies of words tint have an appropriately 
chosen level of similarity or equivalency. It has been 
found that 3 -similarity or 3A} -equivalency are appropri- 
ate criteria to obtain pseudo-families of words for lan- 
guages such as English and French, and appear to 
preserve valid relations between words while being lan- 
guage independent. 

[0102] An extract of an exemplary 3-sirrtilar English 
pseudo-family includes the following words: deptoyabil- 
ity. deployable, deploy, deplcyer. department, departer. 
depart departmental, deprecate, deprecation deprive. 
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depriver. deplore, deplorable. deptoringJy. 
[0103] A second prefix criterion is applied implicitly 
in finding pairs of words that are related by suffix pairs. 
In other words, a pair of words are only related by a suf- 
fix pair if removing one of the suffixes from one of the 
words, obtaining one or more equivalents of the remain- 
ing prefix, and replacing ffte removed suffix with the 
other suffix on one of the equivalents produces the 
other word, which can only be true if the prefix that pre- 
cedes the suffix in one word is equivalent to the prefix 
that precedes the suffix in the other word. 
[0104] The implementation of Fig. 7 begins in box 
300 when a call is received to perform clustering on a 
set of words such as word fist 152 using a set of 
pseudo-suffix pairs and their frequencies such as list 
154, which can be obtained as in Fig. 6. Before pro- 
ceeding further, the implementation could obtain all of 
the 3-simOar or 3/q-equivalent pseudo-families in the set 
of words, or each pseudo-family can be obtained as it is 
needed. 

[0105] The act in box 302 begins an outer iterative 
loop that handles each of the pseudo-families in turn. 
The act in box 304 obtains the next pseudo-family, such 
as by accessing a previously obtained pseudo-family or 
by obtaining one anew. 

[01 OS] Then the act in box 310 begins a first inner 
iterative loop that handles each word pair that can be 
obtained from the pseudo-family. In box 312, each itera- 
tion of the loop sets the suffix-similarity of a word pair. 
The suffix similarity is the highest frequency of the 
pseudo-suffix pairs received in box 300 that relate the 
two words, whether with identical prefixes or acceptable 
equivalent prefixes. If none of the pseudo-suffix pairs 
relate the two words, the suffix similarity is zero 
[0107] The act in box 320 begins a second inner 
iterative loop that handles each word pair with a suffix 
similarity greater than zero, continuing until the test in 
box 320 determines that none of the word pairs have a 
suffix similarity gjeater than zero The act in box 322 
begins each iteration by finding the pair of words that 
are related by a suffix pair with the greatest suffix simi- 
larity. The iteration then branches in box 330 based on 
previous clustering, if any. 

[0108] If both words of the pair have already been 
clustered, the two clusters that include them are 
merged, in box 332. If one of the words has been clus- 
tered and the other has not the non-clustered word is 
added to the cluster that includes the other word, in box 
334. And if neither of the words has been clustered, a 
new cluster is created that includes both of them, in box 
336. 

[0109] Each iteration also includes adjustment of 
suffix similarities of other words with respect to the 
words of the pair found in box 322. As shown in box 340, 
this is done for every word. In box 342, the greater of a 
word's suffix similarities with the words of the pair is 
reduced to zero This ensures that the word will only be 
added to the cluster that includes the words of the pair if 



the lesser suffix similarity is relatively large, thus enforc- 
ing the complete link approach. 
[01 1 0] When all the suffix similarities reach zero for 
one of the pseudo-families, the act in box 350 saves 
s each cluster as a relational family, and also saves each 
unctustered word in a respective one-word relational 
family. The following, separated by lines of asterisks, 
are exemplary clusters obtained from the English 
pseudo-family of 3-srmilar words beginning with dap: 

10 

***** 

depletabifity depteteWe depietableness depJetaUy 

deplete depleter depletion 
***** 

is deptoyabflity deployabSe deptayabteness deptoyabfy 

deploy deptoyer deployment 
***** 

depressant depress depresser depressing^ 
depression depressor depressive oppressiveness 
so depressh/ery 

deprecate deprecation ofeprecafor deprecative dep- 
reoativeness deprecatively deprecafhrity deprecato- 
rfly deprecatory deprecafircgly 

25 ***** 

deposabiBty deposable deposabJeness deposatiy 

depose deposer deposal 
***** 

department departmental^ departmental depart- 
so mentalness departmental 

depart departure departer 

ss [0111] When relational families have been obtained 
for aD the pseudo-families, the act in box 352 returns all 
tie relational families from all the pseudo-iamafies, such 
as in the form of Bst156in Rg. 4. 

40 C.4. Conversion to FST 

[0112] In the current Implementation, a representa- 
tive word can be automatically chosen for each rela- 
tional family returned in box 352. The representative 

46 word can be the shortest word in the family or, if more 
than one word has the shortest length, an artxtrarfly 
chosen word wflh the shortest length, such as the verb, 
or a randomly chosen verb if more than one verb has 
the shortest length. Or the representative couid be the 

50 first word of the shortest length words in alphabetical 
order. 

[0113] All the relational families of a language can 
then be automatically converted into a finite state trans- 
ducer (FST) of the type <fisctosed in US-A-5,625.554. 
65 Each transition of the FST can have two levels, a first of 
which indicates a character of a word and a second of 
which indicates a character of the representative of the 
family that includes the word. Such an FST can be pro- 
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duced using techniques similar to those disclosed in 
Karttunen, L. "Constructing Lexical Transducers". Pro- 
ceedings of the International Conference on Com- 
putational Linguistics, Coling 94, August 5-9, 1994, 
Kyoto, Japan, Vbi. 1, pp. 406-411. 5 

C.5. Lookup 

[0114] The FST produced as described above can 
be used in various ways, some of which are similar to io 
techniques described in US-A-6.625.554. For example, 
it can be accessed with the characters of a word to 
obtain a representative of the relational family that 
includes the word. Or it can be accessed with the repre- 
sentative of a relational tamil/to obtain the words in the is 
family. Or both of these techniques can be performed in 
sequence to use a word to obtain the words in its family. 
[0115] The FST can also be used to obtain a repre- 
sentative of a relational family that is likely to reiateto an 
unknown word, using the techniques disclosed in 20 
copending, coassegned U& Patent Application Wa 
09/XXX.XXX (Attorney Docket Na R/98022G), entitled 
"Identifying a Group of Words Using Wtocfified Query 
Words Obtained from Successive Suffix Relationships", 
incorporated herein by reference. ss 

C.6. Results 

[0116] An FST for English words produced as 
described above has been compared with the deriva- so 
tional lexicon deserved in Xerox Corporation, Xerox 
Linguistic Database Reference (English Version 1.1.4 
ed.). 1994 ("the Xerox Database"), considered to be a 
very high quality derivational lexicon. Initially, deriva- 
tional families were extracted from the Xerox Database 35 
and compared to the relational families in the FST. The 
comparison was based on the number of words which 
would have to be moved in or out of a relational family to 
obtain a counterpart derivational family. Due to over- 
stemming errors, in which unrelated words are in the 40 
same relational family, and understemming errors, in 
which related words are in different relational families, 
there are often differences between the relational and 
derivational famffies. 

[0117] Counterpart relational and derivational fam> 45 
lies were identified by assuming that a word wi is cor- 
rectly placed in a relational family ri if the relational 
family includes most of the words of the derivational 
family of wi and the derivational family of wi includes 
most of the words of the relational family ri. In other so 
words, a relational family and derivational family are 
counterparts only if they are mostly the same. Words 
that are not in both counterparts must be moved to 
make the relational and derivational families the same, 
and the ratio of words that need not be moved, summed 55 
over all the derivational families, to the total number of 
words in all the derivational families is a distance meas- 
ure between the two sets of families. 



[0118] As a preliminary test, relational families were 
obtained from the Xerox inflectional lexicon for English 
using three different hierarchical agglomerative cluster- 
ing techniques— the complete link technique described 
above; a single link technique that makes no use of suf- 
fix family but rather adds an element to a cluster when- 
ever a fink, as detected in box 322 in Fig. 7, exists 
between the element and one of the elements of the 
cluster; and a group average technique that makes par- 
tial use of the notion of suffix family by adding an ele- 
ment to a cluster on the basis of the average fink 
between the element and the elements of the cluster, 
where the average link can be obtained from the suffix 
similarities obtained in box 312 in Fig. 7. When the three 
sets of relational families were compared with the deri- 
vational families from the Xerox Database in the manner 
described above, the following ratios were obtained: sin- 
gle link relational families— 0.47; group average rela- 
tional fancies— 0.77; and complete fink relational 
families— 0.835. These results confirm that the notion of 
suffix famffies is effective, and va&date the use of tie 
complete link technique. 

[0119] The relational families obtained with the 
complete folk technique were also compared with two 
well-known Engish stemmers-the SMART stemmer 
disclosed in Saltan, a. and WJcGffll, MJ. , Introduction to 
Modem Information Retrieval, New Vbric CttcQnow-Hai, 
1983, pp. 130-138 and the stemmer described in Porter, 
M.F.. "An algorithm for suffix stripping". Program, VW. 
14, na 3, July 1980, ppi 130-137. The SMART stem- 
mer, which includes a set of rules with conditions of 
applications, was first implemented twenty-five years 
ago and has undergone much manual revision by gen- 
erations of information retrieval researchers. Porters 
stemmer, currently th e most used stemmer. is compact, 
with only 60 cSfferent suffixes, and is easy to implement 
even though it performs nearly as well as other, more 
complex stemmers. 

[0120] The comparison with the SMART stemmer 
and Porters stemmer was done by first constructing 
families with each stemmer and then comparing the 
constructed famffies wfth the derivational families from 
the Xerox Database in the manner described above. 
The famffies were constructed by admitting the whoJe 
remmatized lexicon of the Xerox Database to each 
steflTmtararaJgrtxgKngwcJttet^ 
The following ratios were obtained: SMART 
stemmer— 0.82; Porter's stammer— 0.65. 
10121] These results indicate that the SMART 
stemmer performs better than Rorter% stemmer, but 
that the relational families from the complete fink tech- 
nique perform at least as weO as the SMART stemmer 
and better than Porter's stemmer. In other words, the 
technique described above can produce relational fami- 
lies closer to actual derivational famffies than the fami- 
nes constructed from the two stemmers. This strength of 
the technique apparently results from its ability to distin- 
guish between groups that may not be distinguished by 
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exceptions in the SMART stemmer and Porter's stem- 
mer. The example above, in which the technique can 
distinguish the relational femfly of "depart" from the rela- 
tional family of "department", is apparently representa- 
tive of a number of similar cases that are not covered by 
exceptions in the stemmers. Another advantage of the 
implementation described above is that the representa- 
tive of each relational family is always an acceptable 
word. 

[0122] An FST tor French words produced as 
described above from the Xerox inflectional lexicon tor 
French has also been tested, though without the benefit 
of a French derivational lexicon comparable to the 
Xerox Database. The test v*as made in the framework of 
an information retrieval task, an environment in which 
differences between stemmers tend to be smoothed. 
The aim of the test was to determine whether the FST 
could increase the performance of an information 
retrieval system operating in French. 
[0123] The test used the data, La document set 
topic set and relevance judgment set, provided within 
the AMARYLLIS project, deserved in Corel A., Kremer, 
P., Land, a, SchfcJer, D.. Schrnitt L, and VTsccglioBi. 
N., "Acces a Pinformatton textuefle en francais: Le cycle 
exptoratoire Amaryifis". f*» JST 1997 FRANCIL de 
I'AUPELF-UREF, 15-16 avril 1997, Avignon, France, pp. 
5-8, a project that evaluates information retrieval sys- 
tems on French data. The test ran three Afferent index- 
ing schemes— the first with no treatment and 
considering words as index; the second replacing each 
word with its lemma from the Xerox inflectional lexicon; 
and the third replacing each word with the representa- 
tive of its relational family from the FST. The average 
precision increased over the first scheme by approxi- 
mately 16.5% with the second scheme and by approxi- 
mately 18% with the mind schema 

C.7. Variations 

[01 24 J The implementations described above could 
be varied in many ways within the scope of the mverv 



[0125] The implementations descrfcsd above have 
been successfully executed on Sun SPARC worksta- 
tions, but implementations could be executed on other 
machines. 

[01261 The implementations descrfoed above have 
been successfully executed using the C prog ra m mi n g 
environment on the Sun OS platform, but other pro- 
gramming environments and platforms could be used. 
[0127] The implementations described above per- 
form clustering over relation values that are suffix simi- 
larities measured by frequency of occurrence of 
pseudo-suffix pairs. The invention could be imple- 
mented to perform clustering over any other automati- 
cally obtained relation value, whether a distance, a 
similarity or dissimilarity coefficient or any other appro- 
priate value, such as mutual information. Further, a rela- 



tion value used in automatic clustering could indicate a 
relation between more than two pseudo-suffixes. 
[0128] The implementations described above 
include pads of speech in each pseudo-suffix pair, but 

s this is optional The invention could be implemented 
without taking part of speech into account 
[0129] The implementations described above use a 
language's inflectional lexicon to obtain relational fami- 
lies of words, but other information about a language 

10 could be used to obtain relational famines, such as a list 
or other data structure inrJcating words that are accept- 
able in the language 

[0130] The implementations described above use 
various clustering techniques, including the complete 

15 link technique, the group average techniqua and the 
single link technique, to produce mutually exclusive 
clusters. Other clustering techniques could be used 
within the scope of the invention, including double link 
and other n-fink clustering and ateo including tech- 

20 niques in which a word can be wffiiin more than one 
cluster. 

[91311 The implementations described above per- 
form clustering on 3-simflar pseudo-famffies obtained 
from a set of words, but clustering could be performed 
2s on the entire word set or on pseudo-famifies meeting 
other criteria. 

[0132] The implementations described above yield 
an FST converted directly from automatically obtained 
relational families, but automatically obtained relational 

30 families in accordance with the invention could also be 
manually "cleaned" or modified to obtain higher quality 
derivational families more rapafly than woukt be possi- 
ble with manual techniques. Further, converting groups 
of words obtained in accordance with the Invention into 

35 another form is optional, and groups of words could be 
stored in many other types of data structures in addition 
to an FST. such as in relational databases. 
[0133] Also, atfcfitiona) types of information about 
automatically obtained relational famSies of words could 

40 be obtained and used in various ways. For example, for 
each relational ternary, a data structure such as a tree 
that summarizes the suffix relations within the family 
could be automatically obtained. For the relational fam- 
ily {deploy, deptoyment, deptoyer, depJoyabJe. deptoya- 

45 tslity}, the top level of such a tree coutd be (Verb), with a 
(ink that acfcfe the suffix "-menf to obtain a (Noun) chad, 
with a fink that adds the suffix "-er" to obtain another 
(Noun) chitd, ard with a fink that eifcSs the suffix '-abJe' 
to obtain an (Adjective) child; the (Adjective) chitd ccuxi, 

so in turn, have a fink that adds the suffix "-ity" to change 
"able* to "ability* to obtain a (Noun) grandchild The 
resulting tree should approximate the ideal derfvational 
tree joining the words of a derivational family, thus pro- 
viding access to potential suffixes of a language, their 

55 nrorphotactics, and their paradigmatic use. Frequencies 
of occurrence of the trees can also be obtained, and 
more probable suffix relation trees can thus be identi- 
fied. 
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[0134] To develop a derivational lexicon from rela- 
tional families, a lexicographer could view the results 
obtained in this manner, review the automatically identi- 
fied derivational processes implicit in the results for 
validity, make modifications in accordance with the 5 
actual derivational processes of a language, and gener- 
ate a resulting set of modified relational families for fur- 
ther study and possible use. Modified relational families 
obtained in this way may provide a belter approximation 
to derivational families than automatically obtained rela- 10 
tional families. A lexicographer might therefore be able 
to develop a derivational lexicon more quicWy without 
sacrificing accuracy, because the lexicographer can 
focus on irregularities of the language under considera- 
tion. - 1S 

[0135] The implementations described above use 
an FST that maps from a word to a representative that 
is a shortest length word in the same relational family 
and from a representative to a fist of words in the fem3y. 
An FST could, however, map to any other appropriate 20 
representative of a family, such as an extracted root or 
even a number or other value that serves as an index 
[0136] The implementations described above 
employ relations between suffixes and do not take rela- 
tionships between prefixes into account in grouping as. 
words, but relationships between prefixes could be 
taken into account by removing certain prefixes before 
grouping words and might be taken into account in other 
ways. 

[0137] The irnpJementaticns described above have 30 
been applied to English and French, and values of 
parameters such as p-simOarity and pft) equivalence, 
although chosen for generality, have only been success- 
fully used whh English and French. The invention can 
be appfied to languages other than English and French 35 
and values of parameters could be modified as neces- 
sary for greater generality or for optimal resurts with any 
specific language. 

[0138] tn the irnpJementaticns descrtoed above, 
acts are performed in an order that ccutd be modified in 40 
many cases. 

[0139] The implementations described above use 
currently available computing techniques, but could 
readily be mooted to use newly discovered computing 
techniques as they become availabla 45 

□.Applications 

[0140] The invention can be appfied to automati- 
cafly produce an approximation of a derivational lexicon. 50 
The result, referred to herein as a "relational lexicon - , 
can then be used as a derivational lexicon would be 
used. For example, from an input word, the relational 
lexicon can obtain another word mat represents a group 
or family of related words, a process sometimes 55 
referred to as "stemming", "normalization", or "lemmati- 
zation'. Unlike an inflectional lexicon, the relational lexi- 
con can stem, normalize, or lemmatize across parts of 



speech. To improve information retrieval performance, 
this can be done in advance for each word in a database 
being searched and. at the time of search, for each 
word in a query. A search can then be performed by 
comparing the representative words, to find items in the 
database that relate to the query. 
[0141] The relational lexicon can also be used to 
generate a group or family of words from one of the 
words in the group or famSy. 
[0142] By analogy to an inflectional lexicon, the 
relational lexicon can also be used to go from one part 
of speech to another within a derivational family, a use- 
ful capability in appficaticn areas such as machine 
translation. For example, if too languages have counter- 
part multiword expressions in which a word or subex- 
pression in a first language has a different part of 
speech than the counterpart word or subexpression in a 
second language, translation coutd be acccmpfehed by 
first obtaining a counterpart word or subexpression in 
the second language that has the same part of speech 
as the word or subexpression in the first language. 
Then, the relational lexicon could be used to derive the 
counterpart word or subexpression in the second lan- 
guage that has the appropriate part of speech. 
[0143] The relational lexicon can also be used as 
part of a terminology extractor from monolingual and 
multilingual points of view, such as to extract indexes 
from a tart for accessing a thesaurus Of to 
types of controlled indexing. 

E- Miscellaneous 

[0144] The invention has been described in relation 
to software implementations, but the invention might be 
implemented with specialized hardware. 
[0145] The invention has been described in relation 
to implementations using serial processing techniques. 
The invention might also be implemented with parallel 
processing techniques- 
Claims 

1. A method of grouping a set of wonte that mayoccur 
in a natural language set, compris in g: 

automatically obtaining suffix relation data indi- 
cating a relation value tor each of a set of rela- 
tionships between suffixes that occur in the 
natural language set; and 

automatically clustering the words in the set of 
words using the relation values from the suffix 
relation data, to obtain g/oup data indicating 
groups of words; two or more words in a group 
having suffixes as in one of the relationships 
and, preceding the suffixes, equivalent sub- 
strings. 
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2. The method of claim 1 in which the relation value for 
a relationship is its frequency of occurrence. 

3. The method of claim 1 or 2 in which the relation- 
ships between suffixes are pairwise relationships, s 

4. The method of claim 3 in which the relation vaJue of 
each pairwise relationship is the number of pairs of 
words that are related to each other by the pair of 
suffixes in the relationship, w 

5. The method of claim 3 in which the act of automati- 
cally clustering the words comprises: 

obtaining, for each of a set of pairs of words, a is 
pairwise similarity value based on the relation 
value of a suffix pair, if any. that relates the 
words to each other; and 

perforrrring automatic clustering using the pair- so 
wise similarity values for the pairs of words. 

6. The method of claim 5 in which the act of perform- 
ing automatic clustering performs complete link 
clustering. 2s 

7. The method of claim 5 or 6 in which the pairwise 
similarity value for a pair of words is equal to the 
greatest relation value of the relationships between 
suffixes that relate the words in the pair to each 30 
other. 

a The method of claim 3 in which the natural lan- 
guage set includes one natural language and the 
act of automatically obtaining suffix relation data 35 
comprises: 

using a lexicon for the language to obtain a 
word list indicating the set of words; 

40 

using the word fist to obtain suffix pair data indi- 
cating pairs of suffixes that relate words in the 
set of words to each other; and 

for each pair of suffixes indicated by the suffix as 
pair data, obtaining a relation value indicating a 
number of times the suffix pair occurs in the set 
of words. 

9. The method of claim 8 in which the lexicon is an so 
inflectional lexicon for the language. 

10. The method of claim 8 or 9 in which the suffix pair 
data further indicate, for each suffix in a pair, a part 

of speech; the relation value iraficatmg the number 55 
of times the suffixes in the suffix pair occur in the 
set of words with the indicated parts of speech. 



11. The method according to claims 1 to 10, further 
comprising: 

automatically obtaining, for each group of 
words ireficated by the gjoiip data, a represent- 
atrve; and 

automatically producing a data structure that 
can be accessed with a word in a group to 
obtain the group's representative. 

12. The method of claim 1 1 in which the act of automat- 
ically obtaining a representative selects the short- 
est word in a group as the representative. 

13. The method of claim 11 or 12 in which the data 
structure can also be accessed with a group's rep- 
resentative to obtain a list of words in the group. 

14. The method of claim 11 . 12 or 13 In which the data 
structure is a finite state transducer data structure 

15. Asystem^groupirtgas6tcrfwc^tfiatc«ajrina 
natural language, comprising: 

memory for storing data; and 
a processor connected for accessing the mem- 
ory; the processor operating to: 

automatically obtain suffix relation data 
indicating a relation value for each of a set 
of relatfonshqjs between suffixes that 
occur in the natural language; the proces- 
sor storing the suffix relation data in mem- 
ory; and 

automatically cluster the words in the set 
using the relation values from the suffix 
relation data, to obtain group data indicat- 
ing groups of words; two or more words in 
a group having suffixes as in one of the 
relationships and, preceding the suffixes, 
equivalent substrings; the processor stor- 
ing the group data in memory. 

16. The system of claim 15, further comprising: 

an inf (actional lexicon stored in memory; 

the processor, in automatically obtaining 
suffix relation data, accessing the inflec- 
tional lexicon in memory. 

17. The system of claim 15 or 16 in whichthe processor 
further operates to: 

automatically obtain, tor each group of words 
indicated by the group data, a representative; 
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and 

automatically produce a data structure that can 
be accessed with a word in a group to obtain 
the group's representative; the processor stor- 5 
ing the data structure in memory. 

18. The system of claim 17, further composing a stor- 
age medium access device for accessing a storage 
medium; the processor being connected for provid- w 
ing data to the storage medium access device; the 
processor further operating to provide the data 
structure to the storage medium access device; the 
storage medium access device storing the data 
structure on the storage medium. is 

19. An article of manufacture produced by the system 
of claim 18; the article of manufacture comprising: 

the storage medium; and 20 

the data structure stored on the storage 
medium. 

20. The system of claim 17, 18 or 19 in which the proc- 25 
essor is further connected for establishing connec- 
tions with machines over a network; the processor 
operating to: 

establish a connection to a machine over the 30 
network; and 

transfer the data structure to the machine over 
the network. 
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